{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 使用ModelScope Swift微调Baichuan2模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 介绍\n",
    "\n",
    "[Baichuan 2](https://github.com/baichuan-inc/Baichuan2)是[百川智能](https://www.baichuan-ai.com/home)推出的开源大语言模型，采用2.6万亿Tokens的高质量语料进行训练，在多个权威的中文、英文和多语言的通用、领域benchmark上取得了同尺寸最佳的效果。`Baichuan2` 目前发布了7B、13B的Base和Chat版本，支持模型商用。\n",
    "\n",
    "当在特定领域使用大语言模型时，可以通过prompt的方式引导模型，也可以通过在领域数据集上微调训练，从而在领域的任务上获得更好的效果。后者的优点是不依赖于Prompt（可能超过模型的输入长度上限），有更好的推理性能，并且经过微调后，在领域相关任务上有更好的效果。\n",
    "\n",
    "本文将介绍如何在PAI对`Baichuan2`模型完成微调训练。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## 安装和配置SDK\n",
    "\n",
    "我们需要首先安装PAI Python SDK以运行本示例。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "!python -m pip install --upgrade alipai"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "\n",
    "SDK需要配置访问阿里云服务需要的AccessKey，以及当前使用的工作空间和OSS Bucket。在PAI SDK安装之后，通过在**命令行终端** 中执行以下命令，按照引导配置密钥、工作空间等信息。\n",
    "\n",
    "\n",
    "```shell\n",
    "\n",
    "# 以下命令，请在 命令行终端 中执行.\n",
    "\n",
    "python -m pai.toolkit.config\n",
    "\n",
    "```\n",
    "\n",
    "我们可以通过以下代码验证配置是否已生效。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pai\n",
    "from pai.session import get_default_session\n",
    "\n",
    "print(pai.__version__)\n",
    "\n",
    "sess = get_default_session()\n",
    "\n",
    "# 获取配置的工作空间信息\n",
    "assert sess.workspace_name is not None\n",
    "print(sess.workspace_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 准备训练脚本\n",
    "\n",
    "`ModelScope`提供了[SWIFT(Scalable lightWeight Infrastructure for Fine-Tuning)](https://github.com/modelscope/swift#swiftscalable-lightweight-infrastructure-for-fine-tuning)框架，支持模型的全参数微调，也集成了各种高效微调方法，例如`LoRA`、`QLoRA`等，支持用户对`Baichuan2`、`QWen`、`llama2`等常见的语言进行微调训练。\n",
    "\n",
    "基于[Swift的LLM finetune脚本](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/src/llm_sft.py)，我们修改了部分逻辑，从而支持用户在PAI的训练作业中使用，主要包括：\n",
    "\n",
    "- 使用PAI预置的`Baichuan2-Base`模型\n",
    "\n",
    "对于热门的社区模型，PAI提供了模型缓存在OSS Bucket上，支持挂载到训练作业，训练脚本可以通过读取本地文件的方式加载获取模型。\n",
    "\n",
    "- 保存模型\n",
    "\n",
    "训练脚本需要将模型保存到指定路径(`/ml/output/model`)，从而将模型保存到用户的OSS Bucket中。\n",
    "\n",
    "- 训练依赖的第三方\n",
    "\n",
    "训练作业将运行在PAI提供的`PyTorch`基础镜像上，我们需要在作业环境中安装`transformers`、`datasets`、`swift`、`xformers`等第三方依赖。PAI训练作业支持使用训练脚本目录下的`requirements.txt`安装第三方依赖。\n",
    "\n",
    "\n",
    "完整的训练脚本请参考 `train_src` 目录下的训练文件(`llm_sft.py`)。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 提交训练作业\n",
    "\n",
    "使用提交任务的方式训练模型，能够支持用户并行运行多个训练任务，高效得探索不同的超参组合对于模型性能影响，并且能够支持分布式训练。通过PAI Python SDK提供的`Estimator`API，我们可以方便得将一个本地训练脚本提交到PAI上运行。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "我们将通过以下代码配置训练作业脚本、作业启动命令、使用的作业镜像，以及机器实例规格，提交训练作业。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pai.image import retrieve\n",
    "from pai.estimator import Estimator\n",
    "\n",
    "# 训练作业启动命令\n",
    "# 完整的参数说明请参考文档：https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/README_CN.md#sftsh-%E5%91%BD%E4%BB%A4%E8%A1%8C%E5%8F%82%E6%95%B0\n",
    "command = \"\"\"CUDA_VISIBLE_DEVICES=0 \\\n",
    "python llm_sft.py \\\n",
    "    --model_type baichuan2-7b \\\n",
    "    --sft_type lora \\\n",
    "    --template_type default-generation \\\n",
    "    --dtype fp16 \\\n",
    "    --output_dir /ml/output/model/ \\\n",
    "    --dataset advertise-gen \\\n",
    "    --train_dataset_sample 20000 \\\n",
    "    --num_train_epochs 1 \\\n",
    "    --max_length 2048 \\\n",
    "    --quantization_bit 4 \\\n",
    "    --lora_rank 8 \\\n",
    "    --lora_alpha 32 \\\n",
    "    --lora_dropout_p 0. \\\n",
    "    --lora_target_modules ALL \\\n",
    "    --gradient_checkpointing true \\\n",
    "    --batch_size 16 \\\n",
    "    --weight_decay 0. \\\n",
    "    --learning_rate 1e-4 \\\n",
    "    --gradient_accumulation_steps 4 \\\n",
    "    --max_grad_norm 0.5 \\\n",
    "    --warmup_ratio 0.03 \\\n",
    "    --eval_steps 100 \\\n",
    "    --save_steps 100 \\\n",
    "    --save_total_limit 2 \\\n",
    "    --logging_steps 10\n",
    "\"\"\"\n",
    "\n",
    "\n",
    "# 配置训练作业\n",
    "est = Estimator(\n",
    "    source_dir=\"train_src/\",  # 代码目录\n",
    "    image_uri=retrieve(\"PyTorch\", framework_version=\"latest\").image_uri,  # 训练作业使用的镜像\n",
    "    command=command,  # 训练启动命令\n",
    "    instance_type=\"ecs.gn6e-c12g1.3xlarge\",  # 使用的机器规格示例，V100(32G)\n",
    "    instance_count=1,  # 机器实例个数\n",
    "    base_job_name=\"baichuan2_finetune\",  # 训练作业名称\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PAI提供了预置的`Baichuan2-Base`模型，可以通过以下方式获取对应的模型`OSS Bucket`路径。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pai.model import RegisteredModel\n",
    "\n",
    "# 获取PAI提供的Baichuan2-7B-Base模型\n",
    "m = RegisteredModel(\n",
    "    model_name=\"baichuan-inc/Baichuan2-7B-Base\", model_provider=\"huggingface\"\n",
    ")\n",
    "\n",
    "# 模型地址\n",
    "print(m.model_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "提交训练作业，等待作业完成。用户可以通过打印的作业详情页URL，查看训练作业进度，资源使用，日志等信息。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "notebookRunGroups": {
     "groupValue": "2"
    }
   },
   "outputs": [],
   "source": [
    "# 提交训练作业\n",
    "est.fit(\n",
    "    inputs={\n",
    "        # 训练代码可以从 /ml/input/data/pretrained_model/ 目录下读取挂载的预训练模型\n",
    "        \"pretrained_model\": m.model_data,\n",
    "    },\n",
    "    wait=False,  # 是否等待训练作业完成\n",
    ")\n",
    "\n",
    "# 打开一个TensorBoard，监控训练作业\n",
    "tb = est.tensorboard()\n",
    "\n",
    "\n",
    "# 等待训练作业完成\n",
    "est.wait()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "训练作业写出到 `/ml/output/model` 目录下的模型文件和checkpoints将被保存到用户的OSS Bucket中，可以通过 `est.model_data()` 获取 OSS Bucket路径。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 查看数据模型的OSS Bucket路径\n",
    "print(est.model_data())\n",
    "\n",
    "\n",
    "# 删除启动的TensorBoard（每一个账号下最多能够启动5个TensorBoard示例）\n",
    "tb.delete()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 结语\n",
    "\n",
    "在当前示例中，我们展示了如何基于`ModelScope Swift`框架，使用PAI预置的`Baichuan2-Base`模型，完成`Baichuan2`模型的微调训练。用户可以参考以上的示例，修改脚本，使用用户自定义的数据集，或是修改使用的基础预训练模型，完成自定义语言模型的微调训练。\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "pai-dev-py38",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
